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Abstract 

Estimator algorithms in learning automata are useful tools for adaptive, real- 
time optimization in computer science and engineering applications. This pa- 
per investigates theoretical convergence properties for a special case of estimator 
algorithms — the pursuit learning algorithm. In this note, we identify and fill a gap 
in existing proofs of probabilistic convergence for pursuit learning. It is tradition 
to take the pursuit learning tuning parameter to be fixed in practical applications, 
but our proof sheds light on the importance of a vanishing sequence of tuning 
parameters in a theoretical convergence analysis. 

Keywords and phrases: Convergence in probability; indirect estimator algo- 
rithms; learning automata. 



1 Introduction 



A learning automa ton consists of an adaptive learn ing agent operating in unknown ran- 
dom environment fjNarendra and Thathacharlll989l ). In a nutshell, a learning automaton 
has a choice among a finite set of actions to take, with one such action being optimal 
in the sense that it has the highest probability of producing a reward from the envi- 
ronment. This optimal action is unknown and the automaton uses feedback from the 
environment to try to identify the optimal action. Applications of learning automata in- 
clude game theory, pattern recognition, computer vision, and routing in communications 
networks. Recently, learning automata have been used fo r call routing in AT M networks 
( lAtlasis et al.l 120001 ) . multiple a ccess channel sele ction (jZhong et al.l 120101 ) . congestion 
avoidance in wireless networks (IMisra et al.l 120091 ). channel selection in radio networks 
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flTuan et al.ll2010l ). modeling of students' behavior (lOommen and Hashemi 120101) . clus- 



tering and backbone formation in ad-hoc wireless net works ( iTorkestania and Meybodi 
ijbth power system stabilizer s ( Kashki et al. 2010 ). and spectrum allocation in cog- 



2010j 



nitive networks ( ILixia et al.ll2010f ). The simplest type of lear ning automata applies a di- 
rect algorithm, such as the linear reward-inaction algorithm ( INarendra and Thathachar 
19891 ). which uses only the environmental feedback at iteration t to update the prefer- 



ence ordering of the actions. A drawback to using direct algorithms is their slow rate of 
convergence. Attention recently has focused on the faster indirect estimator algorithms. 
What sets indirect algorithms apart from their direct counterparts is that they use the 
entire history of environmental feedback, i.e., from iteration 1 to t, to update the ac- 
tion preference ordering at iteration t. It is this more efficient use of the environmental 
feedback which leads to faster convergence. 

Here we consider a special case of indirect estimator a lgorithms — the pursuit l earnin g 
algorithm — and, in particular, the version presented by iRajaraman and Sastry fll996h . 
Starting with vacuous information about the unknown reward probabilities, pursuit 
learning adaptively samples actions and tracks the empirical reward probabilities for 
each action. As the algorithm progresses, the sampling probabilities for the set of ac- 
tions are updated in a way consistent with the relative magnitudes of the empirical 
reward probabilities; see Section 12.11 Simulations demonstr ate that the algorithm is fast 



to converge in a number of different estimation scenarios (|Lanct6t and Oommenlll992 



Oommen and Lanctot 


1990|; 


Sastrv 


1985; 



the algorithm is said to converge if the sampling probability for the action with the 
highest reward probability becomes close to 1 as the number of interations increases. 

In the learning automata literature, e-optimality is the gold standard for theoretical 
convergence. But there seems to be two different notions of e-optimality that appears 
in the estimator algorithm literature. The version that appears in the direct estimator 
context (e.g., the linear reward- inaction algorithms) is in some sense weaker than that 
which appears in the indirect algorithm context. The latter is essentially convergence in 
probability of the dominant action sampling probability to 1 as the number of iterations 
approaches infinity. Section 12.21 describes these two modes of stochastic convergence in 
more detail, but our focus is on the latter convergence in probability version. 

The main goal of this paper is to identify and fill a gap in existing proofs of £-optimality 
for pursuit learning. We believe that it is important to throw light on this gap because 
there are relatively recent papers proposing new algorithms that simply copy verbatim 
these incomplete arguments. Specifically, in many proofs, the weak law of large numbers 
is incorrectly interpreted as giving a bound on the probability that the sample path stays 
inside a fixed neighborhood of its target forever after some fixed iteration. It is true that 
any finite-dimensional properties of the sample path can be handled via the weak law 
of large numbers, but the word "forever" implies that countably many time instances 
must be dealt with and, hence, more care must be taken. A detailed explanation of the 
gap in existing proofs is presented in Section 12.31 In Section [3] we give a new proof of 
convergence in probability for pursuit learning with some apparently new arguments. A 
further consequence of our analysis relates to the algorithm's tuning parameter. Indeed, it 
standard to assume, in both theory and practice, that the algorithm's tuning parameter is 
a small but fixed quantity. However, our analysis suggests that it is necessary to consider 
a sequence of tuning parameters that vanish at a certain rate. 
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2 Pursuit learning algorithm 



2.1 Notation and statement of the algorithm 

Suppose a learning automaton has a finite set of actions A = {a\, . . . ,a r }. If the au- 
tomaton plays action aj, then it earns a reward with probability df, otherwise, it gets a 
penalty. An estimator algorithm tracks this reward/penalty information with the goal if 
identifying the optimal action — the one having the largest reward probability d. Pursuit 
learning, described below, is one such algorithm. 

At iteration t, the automaton selects an action a(t) G A with respective probabilities 
n(t) = {iTi(t), . . . , 7r r (t)}. When this action is played, the environment produces an 
outcome X(t) G {0, 1} that satisfies 

di = E{X(t) | a(t) = aj}, i = 1, . . . , r. 

As the algorithm proceeds and the various actions are tried, the automaton acquires more 
and more information about the gPs indirectly through the X's. In other words, estimates 
d(t) of d at time t can be used to update the sampling probabilities ir(t) in such a way 
that those actions with large d(t) are more likely to be chosen again in the next iteration. 
Algorithm [1] gives the details. 

For comparison, the direct linear reward-inaction algorithm updates n{t) according 
to the following rule: If a(t) = then 



IT jit) 




l) + XX(t)[l-7T j (t-l)} if J = i, 
1) — XX(t)7Tj(t - 1) if j^i. 



It is clear that this direct linear reward-inaction algorithm does not make efficient use 
of the full environmental history X(l), . . . ,X(t) up to and including iteration t. For 
this reason, it suffer s from slower convergenc e than that of the indirect pursuit learning 
algorithm. In fact, Thathachar and Sastryl ( 1985 ) demonstrate, via simulations, that 



an indirect algorithm requires roughly 87% fewer iterations than a direct algorithm to 
achieve the same level of precision. 

One might also notice that the pursuit learning a lgorithm is not unli k e the pop- 



ular stochastic appro ximation methods introd uced in iRobbins and Monro! (Il95ll ) and 



discussed in detail in iKushner and Yin! (120031 ). But a convergence analysis of pursuit 



learning using the powerful ordinary differential equation techniques seems particularly 
challenging due to the discontinuity of the 5 m (t) component in the Step 2(c) update. 

The internal parameter A controls the size of steps that can be made in moving from 
ii(t — 1) to vr(t). In general, small values of A corre spond to slower ra tes of convergence, 



and vice versa. In our asymptotic results, we follow iTilak et al.l (120111 ) and actually take 
A = \ t to change with t. They argue that a changing A is consistent with the usual notion 
of convergence (see also Section I2~2l) . and does not necessarily conflict with the practical 



choice of small fixed A. In what follows, we will assume that 

A t = 1 - 9 l/t , for some fixed 6 G (e~\ 1), (1) 
although all that is necessary is that - as t — > oo. 
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Algorithm 1 — Pursuit Learning. 



1. For i = 1, . . . , r, set 



7r.(0) = l/r and JVj(O) = 0, 



and initialize di(0) by playing action a« a few times and recording the proportion 
of rewards. Set t = 1. 

2. (a) Sample a(t) according to n(t — 1), and observe X(t) drawn from its conditional 
distribution given a(t). 

(b) For i — 1, . . . , r, update 



JVi(t) 



iVi(t-l) + l if a(t) = ai 
Ni(t-l) ifa(t)^di 



which denotes the number of times action has been tried up to and including 
iteration t, and 

I — 1) if 7^ a i; 

and then compute 

m(i) = argmax{<ii(t), . . . ,d r (t)}. 

(c) Update 

7r (t) = (l-X)n(t-l) + \5 m{t) , 
where 8j is an r-vector whose j th entry is 1 and the others 0. 

3. Set t ^— t + 1 and return to Step 2. 



2.2 Convergence and e-optimality 

Convergence of an estimator algorithm like pursuit learning implies that, eventually, the 
automaton will always play the optimal action. In other words, if d\ is the largest among 
the efs, then 7ii(t) gets close to 1, in some sense, as t — > oo. This convergence is typically 
called e-optimality, although there appears to be no widely agreed upon definition. 

In the context of indirect estimator algorithms, the following is perhaps the most 
common definition of e-optimality. We shall henceforth assume, without loss of generality, 
that action a\ is the unique dominant action, i.e., d\ is the largest of the efs. 

Definition 1. The pursuit learning algorithm is e-optimal if, for any e, 5 > 0, there exists 
T* = T*(e, 5) and A* = A*(e, 5) such that 

P{7ri(t)>l-e}>l-<5, (2) 

for all t > T* and A < A*. Simply put, the algorithm has the e-optimality property if 
7Ti(t) — > 1 in probability as (t, A) — > (oo, 0). 
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This is the definition of e-optimality that appears inlAgache and Oommenl (120021 ) and 
the references mentioned in Section (U iThathachar and Sastryl (119851 ) say an algorithm 
that satisfies Definition [1] is optimal in probability, arguably a better adjective. How- 
ever, a different notion of e-optimality can be found in other contexts. This one says 
that the algorithm is e-optimal if, for any e > 0, there exists a fixed A > so that 
liminff-^oo 7i"i(i) > 1 — e with probability 1. Compared to Definition HJ this latter def- 
inition is, on one hand, stronger because the condition is "with probability 1" but, on 
the other hand, weaker because it does not even require ni(t) to converge. Since one 
will not, in general, imply t he other, it is unclear which definition is to be preferred. 
Oommen and Lanctot (Il990h and others have recognized the difference between the two, 
but apparently no explanation has been given for choosing one over the other. 

Since both T* and A* in Definition [T] are linked together through the choice of (e, 5), 
it is intuitively clear that A should decrease with t. In fact, allowing A to change with 
t appears to be necessary in the proof presented in Section 13.21 So, throughout this 
paper, our notion of e-optimality will be that (J2J) holds for all t > T* with the particular 
(vanishing) sequence of tuning parameters {X t } in (TTj). 



2.3 Existing proofs of e-optimality 

Here we shall identify the gap in exis ting proofs of e-optimality fo r pursuit learning. 
Focus will fall primarily on the proof in lRajaraman and Sastryl (119961 ). but this is just for 
concreteness an d not to single out these pa rticular authors. In fact, essentially the same 



gap appears in iPapadimitriou et al.l (120041 ); there is a similar mis-step in other papers 



which we mention briefly below. The outline of these proofs goes roughly as follows: 



Step 1. Show that iVj(t) — > oo in probability for each i — 1, . . . , r as t — > oo. That is, 
show that for any large n and small 5, there exists T* such that 



P{ min NAt) > n\ > 1 - 5, V t > T 

U=l,...,r J 



(3) 



Step 2. Show that for any small 5 and p, there exists n such that 



'< max \di{t) 

U=l,...,r 



di\ < p 



min Ni(t) > n 

i=l,...,r 



> 1 - 5. 



(4) 



Rajaraman and Sastrvl (119961 ) apply the famous inequality of iHoeffdingi (119631 ) to 



get an expression on the right-hand side that approaches 1 exponentially fast in n. 
A similar idea is used in Section | 



Step 3. Reason from (J3J) that, for sufficiently small p and for all t larger than some T* 
d\(t) will be the largest among the di(t)'s with probability at least 1 — 5. 



Step 4. Apply the monotonicity property (ILanctot and Oommenl |l992j) to show that 
7Ti(i) increases monotonically to 1 starting from some t > T* and must, therefore, 
eventually cross the 1 — e threshold. 



The trouble with this line of reasoning emerges in Step 3, and is a consequence of an 
incorrect interpretation of the law of large numbers. Roughly speaking, what is needed 
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in Step 3 is a control of the entire d(t) process over an infinite time horizon, t > T*, but 
the law of large numbers alone can provide control at only finitely many time instances. 
More precisely, from Steps 1 and 2 and the law of large numbers one can reason that 

P{di(f) is the largest of the d(i)'s} > 1 - 5, V t > T*. (5) 

But even though the left-hand side above is monotone increasing in t, one cannot conclude 
directly from this fact that 

P{di(t) is the largest of the d(t)'s for all t >T*} > 1 - 5. (6) 

Rajaraman and Sastryl (jl996l ) implicitly assume that (O implies (ED in their proof o f 



£-optimality. A slight l y diff erent oversight is made in iThathachar and Sastrvi (Il987l ). 
Oommen and Lanctotl f 1990h . and Lanctot and OommenT f 19921 ) . They assume that 



P{7Ti(t) > 1 — £ | di(t) is the largest of the d(i)'s} 

can be made arbitrarily close to 1 for large enough t. However, the knowledge that di(t) 
is the largest of the d(i)'s only at time t provides no control over how close 7Ti(i) is to 1. 
The monotonicity property in Step 4 requires that the a!(i)'s be properly ordered forever, 
not just at a single point in time. 

It will be insightful to have a clearer picture of what the problem is mathematically. 
First, the left-hand side of (jSJ) is, in general, much smaller than the left-hand side of flS}, 
so the claim "(jSJ) =>■ flS}" immediately seems questionable. In fact, if E t is the event that 
di(t) is the largest of the (i(t)'s at time t, then from (jSJ) we can conclude that 

liminfP{^ t }> 1-5. (7) 

t— >oo 

But the event inside P{- • • } in (jSJ) is 

t>T* T*>lt>T* 

and it follows from Fatou's lemma that 

left-hand side of © < Pjliminf E t \ < liminf P{E t }. (8) 

So, from (J7j) and (jSJ), we can conclude only that the left-hand side (jBJ) is bounded from 
above by something greater than 1 — 5 and, hence, (j^J) need not imply dSJ). Therefore, 
some pursuit learning-specific considerations are needed and, to the authors' knowledge, 
there is no obvious way to fill this gap. In the next section we give a proof of £-optimality 
based on some apparently new arguments. 



3 A refined analysis of pursuit learning 

3.1 An infinite-series result 

Here we state an infinite-series result which will be useful in our analysis in Section 13.2 
For completeness, a proof is given in Appendix [A] 
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Lemma 1. Given a,b G (0, 1), let ((t) = (1 - a/t b y, t > 1. Then J2Zi CO) < oo. 

It is easy to see that the condition b e (0, 1) is necessary. Indeed, if b > 1, then the 
sequence itself converges to e~ a and the series cannot hope to converge. In our pursuit 
learning application below, this condition will be taken care of in our choice of tuning 
parameter sequence X t . 



3.2 Main results 



We s tart by summarizing a few known results from the literature (see, e.g., iTilak et al.l 



20 111 ) which will be needed in the proof of the main theorem. Recall the notation iVj(i) 



used for the number of times, up to iteration t, that action % has been tried, % = 1, . . . , r. 
The first result is that all of the iV(t)'s are unbounded in probability as the number of 
iterations t increases to oo. 

Lemma 2. Suppose \ t satisfies ([T]) with e _1 < 9 < 1. Then for any small 5 > and any 
K > 0, there exists T* such that, for each i = 1, . . . , r, 

?{Ni{t) < K} < 5, \/t>T*. 

As the number of times each action is played is increasing to oo, it is reasonable 
to think that the estimates, namely the d(t)'s, should be approaching their respective 
targets, the d's. It turns out that this intuition is indeed correct. 

Lemma 3. Suppose Xt satisfies ([T]) with e" 1 < 9 < 1. Then for any small 5 > and any 
small 7] > 0, there exists T£ such that, for each i — 1, . . . , r, 

P{\di(t) - di\ > rj} < 5, Vt>T*. 

An alternative way to phrase the previous two lemmas is that, under the stated 
conditions, iVj(t) and di(t) converge in probability to oo and d iy respectively, as t — > oo. 
Next is the main e-optimality result. 

Theorem 1. Suppose \ t satisfies ([I]) with e v < 6 < 1. Then for any small e,S > 0, 
there exists T* such that 

P{7Ti(t) >l-e}> 1-5, Vt>T\ 



To prove Theorem (U we shall initially follow the argument of iRajaraman and Sastry 



fll996l ). To simplify notation, define the events 

A £ (t) = {n 1 (t)>l-e}, t>l. 

Next, we observe that, since the reward probabilities are fixed, there is a number i] > 
such that, if \di(t) — d x \ < r], then di(t) must be the largest of the estimates d{t) at 
iteration t. For this 77, define the two sequences of events 

B{t) = {\d 1 (t)-d 1 \<ri\, t>0, 
B(T) = {sup |cZi(» - di| < v} = Pi T > °- 

^ T t>T 



7 



Then, for any positive integers t and T, the law of total probability gives 

P{A £ (t + T)}> P{A £ (t + T) | B(T)}P{B(T)}. 

Moreover, from the monotonicity property of pursuit learning, it follows that there exists 
T3 such that 

P{A £ (t + T) I B(T)} = 1, Vt>7* 

Therefore, to complete the proof, it remains to show that there exists T > such that 
P{B(T)} >l-5. But by DeMorgan's law, 

P{B(T)} = P{ p| B(t)} = 1 - P{ |J B(ty}, 

t>T t>T 

so we are done if we can find T > such that 

P{\jB(ty}<5. (9) 

t>T 

Towards this, write N(t) = Nx(t) and note that 

p {U 5 w c }^E p w) c } 

t>T t>T 

t 

= E(E p w) c 1 ^(*) = ^}P{^W = n} 

It follows easily from Hoeffding's inequality that 

P{B(t) c I N(t) =n} = P{K(t) - di| > 77 I N(t) = n) < e~ 

where h = r] 2 /8 > is a constant independent of n. Therefore, 

t 

P {U B ^ C ) ^ E(E P { fi W C I N ® = ™}P{N(t) = n} 

t>T t>T n=0 

t 

^E(E e "" np w t ) = ^)' 

t>T n=0 

and the inner- most sum is easily seen to be the moment generating function, call it tpt{u), 
of the random variable N(t) evaluated at u = — h. To prove that this sum is finite, we 
must show that ip t (—h) vanishes sufficiently fast in t. 

Formulae for moment generating functions of standard random variables are readily 
available. But N(t) is not a standard random variable; it is like a Bernoulli convolution 
fjKlenke and Mattnerll2010l ; iProschan and Sethuramanlll976l ) but the summands are only 
conditionally Bernoulli. In Lemma H] below we show that ipt(u), for u < 0, is bounded 
above by a certain binomial random variable's moment generating function. 

Lemma 4. Consider a binomial random variable with parameters (t,ut), where Ut = 
7Ti(0)^ 7 ^- ) and "f(t) = XL=i s_1 - V V 9 * ^ s ^ e corresponding moment generating function, 
then iptiu) < <pt(u) for u < 0. 



hn 
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Proof. Let N(t) = Yll=i £{ s )> where £(s) = 1 if the optimal action is sampled at iteration 
s and otherwise. If srf s denotes the cx-algebra generated by the history of the algorithm 
up to and including iteration s, then £(s) satisfies 

E{e(a)|^_i} = 7r(a-1). 

For u < 0, the moment generating function ipt{u) of N(t) satisfies 

^(«) = E{e u7VW } = E{e< (1)+ - + < (t) } 



E{nE(e<WK s _ 1 )} = E{n[ 

s=l s=l 



1-fl 



But iRajaraman and Sastrvl (119961 ) show that u t < min{7r(l), . . . , 7r(t)}, so 
if>t(u) < Hi 1 ~ t 1 ~ e "H = " ( X " e >t]'- 



But the right-hand side above is exactly (ft(u), completing the proof. 
Back to our main discussion, we now have that 



□ 



t>T 



t>T 



t>T 



and it is well-known that the moment generating function <p t for the binomial random 
variable satisfies 

tp t {-h) = [l - u> t (l - e- h )Y = [1 - 7rx(0)(l - e- h )^]*. 

But the sequence 7(t) grows like ln(t), and 9 ln ^ = t la ( e \ so for large t 

7n(0)(l-e-' 1 



<Pt(-h) 



t-HO) 



The right-hand side above is just the sequence ((t) defined in Lemma [TJ with 

a = tti(0)(1 -e" h ) and & = -]n(0). 

Therefore, the series Ylt>i ft(—h) converges so, for any 5, there exists T such that 
J2 t>T ft(—h) < 5, thus proving (J5J). To put everything together, let T£ be the smallest 
T with ^] (>T (/)((- h) < 5. Then Theorem [T] follows by taking T* = T3 + T£. 

A natural question is if one can give a deterministic bound for T* in terms of the 
user-specfied e, 5, and 0. An affirmative answer to this question is given in Appendix [Bj 
We choose not to give much emphasis to this result, as the bound we obtain appears to 
be quite conser vative. For example, for nume rical experiments run under the setup in 
Simulation 1 of iThathachar and Sastrvl ( 119851 ). we find that more than 95% of sample 
paths converge in roughly 25-250 iterations, while our conservative theoretical bounds 
are, for moderate 9, orders of magnitude greater. 
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4 Discussion 



In this paper we have taken a closer look at convergence properties of pursuit learning. 
In particular, we have identified a gap in existing proofs of e-optimality and provided 
a new argument to fill this gap. An important consequence of our theoretical analysis 
is that it seems necessary to explicitly specify the rate at which the tuning parameter 
sequence A = A* vanishes with t. In fact, if Ut defined in Lemma H] vanishes too quickly, 
which it would if X t = A, then 'Yl lt ( ft{—h) = oo and the proof fails. But we should also 
reiterate that a theoretical analysis that requires vanishing X t need not conflict with the 
tradition of running the algorithm with fixed small A in practical applications. In fact, the 
particular At vanishes relatively fast so, for applications, we recommend running pursuit 
learning with, say, 

where to is some fixed cutoff, and x + = max{0, x}. This effectively keeps X t constant for 
a fixed finite period of time, after which it vanishes like that in (TTj). Alternatively, one 
might consider At = 1 — 9 V ^\ where v(t) vanishes more slowly than t~ l . This choice of 
At vanishes more slowly than that in (QQ), thus giving the algorithm more opportunities 
to adjust to the environment. We believe that an analysis similar to ours can be used to 
show that the corresponding pursuit learning co nverges in the sens e of Definition [TJ 



It is also worth mentioning that the results of iTilak et al.l ( 120111 ). for At as in ([TJ, can 
be applied to show that ni(t) — > 1 with probability 1 as (t, X t ) —> (oo, 0). This, of course, 
immediately implies e-optimality in the sense of Definition [TJ However, this indirect 
argument does not give any insight as to how to bound the number of iterations needed 
to be sufficiently close t o convergen c e, as we do — albeit conservatively — in Appendix IB1 



But the result proved in lTilak et al.l ( 120111 ) that d±(t) is largest among the <i(t)'s infinitely 
often with probability 1, together with the formula in (Til)]) in Appendix |B] can perhaps 
be used to reason towards and almost sure rate of convergence for pursuit learning. 
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A Proof of Lemma [T] 

To start, write £(£) as 



at) 



i - 



if m 



a\t 
t> ' 



then ordinary calculus reveals that 



d 
dt 



ln/ W = .n(l-^) + ^ = ->n(l 



t — a/ t — a 



> 0. 
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Therefore, we have shown that ln/(i) and, hence, fit) and, hence, f(t b ) are monotone 
increasing. Moreover, f(t b ) f e _a < 1. Thus, < exp{— at 1 ^ 6 }. So to show that 
Y^tLi C{t) is finite, it suffices to show that, for c = 1 — b, 



oo 

at 



e~ at dt < oo. 



Making a change-of-variable x = t c , the integral becomes 

OO /"OO 1 1 poo 



1 

cx l-l/c c 



1 J I 



at ° dt = / : : . e~ ax dx = - x xjc - x e~ ax dx. 



Since 1/c > 1 and a > 0, the integral is finite, completing the proof. 

Making one more change-of- variables {y = ax), one finds that the last integral above 
can be expressed as 



OO -I POO 

dt = — 7- / y l/c - l e~ y dy = — y- He" 1 
where T(s,x) = u s ~ 1 e~ u du is the incomplete gamma function. 



B A bound on the number of iterations 

As a follow-up to the proof of Theorem [U we give a conservative upper bound on the 
number of iterations T* needed to be sufficiently close to convergence. 

Theorem 2. For given e, 5 G (0,1) and 9 G (e _1 ,l) ; a deterministic bound on the 
necessary number of iterations T* in Theorem [1\ can be found numerically. 

In the proof that follows, we are assuming h to be a known constant, while it actually 
depends on the y used above which, in t urn, depends on the unknown gTs. The bounds 



obtained in lRajaraman and Sastryl (119961 ) also depend on t], called the size of the problem. 



To use this bound in practice, users must estimate 7] by some other means. 

Proof of Theorem^ As stated above, the desired T* is actually a sum + T%. Let's 
begin with T£, the smallest T = T(5) such that ^2 t>T ipt(—h) < 5. From the proof of the 
classical integral test for convergence of infinite series in calculus, it follows that 



/oo 
ipt(-h) dt. 
- 



A modification of the argument presented in Appendix |A] shows that 

/ Vt(-h)dt<-T(b;aT), 
Jt a 

where a = 7Ti(0)(l — e~ h ), b = (1 + ln6*) _1 , and T(s, x) is the incomplete gamma function 
(defined in Appendix |A|) . Since ipT{—h) and T(b; aT) are both decreasing functions of T, 
it is possible to solve the equation 

M-h) + - b T(b-aT) = 5 
a 



ii 



for T numerically to obtain the bound T£ in terms of the user-specified inputs. 

Towards bounding T3 we note that di(t + T4 ) is the largest of the ef s for alH > 1 for 



sampl e paths in a set Q of probability > 1 — 6. For sample paths in this f2, iTilak et al. 
feoilh prove that 

vri(t + T*) = 1 - e^ t+T ^{\ - 7Ti(7?)}, t > 1, (10) 
where 7(2) = X^s=i s -1 . Since ^(T^) G (0, 1), it easily follows that 

l-7n(t + T 4 *) < ^(W), £>1. 
Given e and T4 , it is easy to calculate T* such that 9^ t+T ^ < e for all t > T*. □ 
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